Before you go ahead and run the codes in this coursebook, it’s often a good idea to go through some initial setup. Under the Libraries and Setup tab you’ll see some code to initialize our workspace, and the libraries we’ll be using for the projects. You may want to make sure that the libraries are installed beforehand by referring back to the packages listed here. Under the Training Focus tab we’ll outline the syllabus, identify the key objectives and set up expectations for each module.

1 Background

1.1 Algoritma

The following coursebook is produced by the team at Algoritma for its Data Science Academy workshops. The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc.

1.2 Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

options(scipen = 9999)
rm(list=ls())

You will need to use install.packages() to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library() function:

library(ggplot2)
library(GGally)
library(ggthemes)
library(ggpubr)
## Loading required package: magrittr
library(leaflet)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

The data we’ll be working on is a rather recent Youtube Trending Videos dataset1. It has 13,400 records of trending videos between 14th November 2017 to 21st January 2018, and on each record of trending video is a list of variables:

General information relating to video
- Trending date
- Title (video title)
- Channel Title
- Category ID
- Publish Time
- Comments (Disabled?)
- Ratings (Disabled?)
- Video error or removed?

Statistics on a particular date - Views
- Likes
- Dislikes
- Comment Count

Illustration of a “Trending” section on YouTube:

1.3 Training Objectives

The primary objective of this course is to provide a fun and hands-on session to help participants gain full proficiency in data visualization systems and tools. You will learn to create compelling narratives by combining charting elements with a rich grammar under the guidance of the lead instructor and our team of teaching assistants.

  • Plotting Essential
  • Revision: Built-in Plotting Functionalities
  • Revision: Scatterplot, Histogram, Line and Column Bars
  • Axis, Title and Panel Styles
  • Grammar of Graphics
  • ggplot2 Basics

  • Plotting Better
  • Using Themes
  • Multi-dimensional Faceting
  • Visualizing Geo-Spatial Data with leaflet
  • Lattice Plotting system

By the end of the workshop, Academy students can choose to complete either of the Learn-By-Building modules as their graded assignment:

Ready for Publication
Applying what you’ve learned, create a visualization that is polished with the appropriate annotations, aesthetics and some simple commentary. This can be any visualization using the YouTube dataset, but it should communicate a story.

Interactive Map
Create a web page with an interactive map embedded on it. Use a custom icon for the map markers to represent business locations, and show details about each location pin (“markers”) upon user’s interaction.

This graded assignment is worth (2) Points.

2 Plotting Essentials

R as a statistical computing environment packs a generous amount of tools allowing us to reshape, clean and visualize our data through its built-in capabilities. In the first part of this coursebook, we’ll take a look at many of these capabilities and learn how to incorporate these into our day-to-day data science work.

In the second part of this coursebook, we’ll shift our focus onto ggplot, a plotting system by Hadley Wickham. As you’ll see in this 3 days workshop, this plotting system is among the most popular visualization tools today because of its power, extensibility and simplicity (an unlikely combination).

To get started with plotting in R, let’s start by reading our data into the environment:

vids <- read.csv("USvideos.csv")
names(vids)
##  [1] "trending_date"          "title"                 
##  [3] "channel_title"          "category_id"           
##  [5] "publish_time"           "views"                 
##  [7] "likes"                  "dislikes"              
##  [9] "comment_count"          "comments_disabled"     
## [11] "ratings_disabled"       "video_error_or_removed"

vids is a dataframe with 13,400 observations and 12 variables. The trending_date is stored as a factor and we’ll have to convert that to a date object. In previous workshops, you’ve learned about transformation functions such as as.character(), as.numeric() and as.Date() but here I’ll show you another way of working with dates, and that is through the use of lubridate. lubridate is an R package that makes it easier to work with dates and time. Because trending_date currently stores its values in the yy.dd.mm format, all we need to do is to wrap it in ydm() like so:

vids$trending_date <- ydm(vids$trending_date)

The raw dataset does not have the proper names for each category, but identify them by an “id” instead. The following code chunk “switches” them by “id” and also convert that to a factor. We will also convert our video titles to a character vector:

vids$title <- as.character(vids$title)
vids$category_id <- sapply(as.character(vids$category_id), switch, 
                           "1" = "Film and Animation",
                           "2" = "Autos and Vehicles", 
                           "10" = "Music", 
                           "15" = "Pets and Animals", 
                           "17" = "Sports",
                           "19" = "Travel and Events", 
                           "20" = "Gaming", 
                           "22" = "People and Blogs", 
                           "23" = "Comedy",
                           "24" = "Entertainment", 
                           "25" = "News and Politics",
                           "26" = "Howto and Style", 
                           "27" = "Education",
                           "28" = "Science and Technology", 
                           "29" = "Nonprofit and Activism",
                           "43" = "Shows")
vids$category_id <- as.factor(vids$category_id)

And with that, the next thing we’ll need to do is to convert the publish_time variable into a date-time class object. Because we’re analyzing trending YouTube videos in the US, it makes sense then that we uses a timezone like New York for our analysis:

head(vids$publish_time)
## [1] 2017-11-13T17:13:01.000Z 2017-11-13T07:30:00.000Z
## [3] 2017-11-12T19:05:24.000Z 2017-11-13T11:00:04.000Z
## [5] 2017-11-12T18:01:41.000Z 2017-11-13T19:07:23.000Z
## 2903 Levels: 2008-04-05T18:22:40.000Z ... 2018-01-21T05:44:30.000Z
vids$publish_time <- ymd_hms(vids$publish_time,tz="America/New_York")
## Date in ISO8601 format; converting timezone from UTC to "America/New_York".

Observe how simple lubridate work with dates and time. In fact, when we use ymd, ymd_hms or one of its variants, these functions recognize the patterns and will identify the right separators as long as the order of formats is correct. These functions will also parse dates correctly even when the input contain differently formatted dates!

Let’s see a few more things that lubridate can do. I’ll subset the data for the most popular trending video (by number of views) and we’ll extract information from the trending_date of the most popular trending video:

most <- vids[vids$views == max(vids$views),]
year(most$trending_date)
## [1] 2017
month(most$trending_date)
## [1] 12
day(most$trending_date)
## [1] 14

We will also go ahead and create three new variables for our data frame, storing the hours, period of the day, and the day of the week of each video at the time of publish:

vids$publish_hour <- hour(vids$publish_time)

pw <- function(x){
    if(x < 8){
      x <- "12am to 8am"
    }else if(x >= 8 & x < 16){
      x <- "8am to 3pm"
    }else{
      x <- "3pm to 12am"
    }  
}

vids$publish_when <- as.factor(sapply(vids$publish_hour, pw))
vids$publish_wday <- as.factor(weekdays(vids$publish_time))

While the publish_wday is now a factor, we can also arrange it so our plots later will display them in our desired order:

vids$publish_wday <- ordered(vids$publish_wday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

We’ll also go ahead and convert some of the variables into numeric variables as and where appropriate:

vids[,c("views", "likes", "dislikes", "comment_count")] <- lapply(vids[,c("views", "likes", "dislikes", "comment_count")], as.numeric)

Hopefully up to this point, none of the above data transformation and cleansing process looks too unfamiliar for you! If you do need a refresher, refer to the Programming for Data Science coursebook - we’re really applying many of the same ideas to a new dataset, and so you should feel somewhat comfortable up to this point of the course :)

vids has 13400 records of trending videos, but there are many videos that were trending for a few days and we really only have a collection of 2,986 unique videos. On a very broad average, each video was trending for ~4.5 days.

Let’s create a dataframe, call it vids.u that takes only the first observation of each vids.title within the data. match returns a vector of the positions of matches of its first argument in its second:

vids.u <- vids[match(unique(vids$title), vids$title),]

We’ll also create one more variable timetotrend to measure the time it takes for a video to become “trending”:

vids.u$timetotrend <- vids.u$trending_date - as.Date(vids.u$publish_time)
vids.u$timetotrend <- as.factor(ifelse(vids.u$timetotrend <= 7, vids.u$timetotrend, "8+"))

With these done, we’ll move into the exciting part of this workshop: plotting!

2.1 Base Plotting and Statistical Plots

Statistical plots helps us visually inspect our dataset and there are numerous ways to achieve that in R. The simplest of which is through the plot() function. In the following code we create two vectors, x and y, and created a plot:

plot(as.factor(vids.u$publish_hour), vids.u$likes/vids.u$views)

The above gives us a boxplot that compares the “likes ratio” across different period of the day. We want to observe if there is any correlation between the likes-to-view percentage and the period of time when the video was published. As expected, we did not find any obvious patterns, in part because this compares the “likes” a video have on the first day of it being “trending” and the hour when it was published onto YouTube. In a sense, any kind of effect the hour variable can have has been adjusted for (or significantly reduced) by the noise between these two events.

plot() knows how to pick sensible defaults based on the input vector it was given. To illustrate this point, I’ll subset the data to take only trending videos within the Autos, Gaming and Travel industry (every women’s favorite 3 things):

vids.agt <- vids.u[vids.u$category_id == "Autos and Vehicles" | vids.u$category_id == "Gaming" | vids.u$category_id == "Travel and Events", ]

And notice that as we call plot now on that subset, instead of a boxplot, the plot() function knows that we’re plotting two numeric variables and creates a scatterplot for us instead:

plot(vids.agt$likes, vids.agt$dislikes)

We’ll drop the empty levels from our category_id variable, and also create two new variables that measure the likes and dislikes per video view for each observation:

vids.agt$category_id <- factor(vids.agt$category_id)
vids.agt$likesp <- vids.agt$likes/vids.agt$views
vids.agt$dislikesp <- vids.agt$dislikes/vids.agt$views

Our earlier scatterplot really isn’t very informative or even pleasant to look at. The key, as it is with data visualization in general, is to have our plot be effective. A plot that is effective complements how human visual perception works. The scatterplot is ineffective because it could be communicating more with less “visual clutter” at the bottom left.

An approach to fix that is by coloring the plots. In the following code chunk, the first line is identical to the code that produces the scatterplot above except for one addition, the col (color) parameter. We mapped the color parameter to the category so the points are colored accordingly.

I’ve also added a dashed line (lty) with a width of 2 (lwd=2) to show that the correlation between the likes-to-view and dislikes-to-view ratio.

Finally, I added a legend for our plot to show how the colors of our scatterplot points map to each level of our category_id variable.

Here’s the code:

plot(vids.agt$likesp, vids.agt$dislikesp, col=vids.agt$category_id, pch=19)
abline(lm(vids.agt$dislikesp ~ vids.agt$likesp), col=8, lwd=2, lty=2)
legend("topright", legend=levels(vids.agt$category_id), fill=1:3)

With plot, the default for two numerical variables is to plot a scatterplot, but we can override the default parameters with the type argument. The following code chunk is identical to the one above, except for the type="h" addition:

plot(vids.agt$likesp, vids.agt$dislikesp, col=vids.agt$category_id, type="h")
abline(lm(vids.agt$dislikesp ~ vids.agt$likesp), col=8, lwd=2, lty=2)
legend("topright", legend=levels(vids.agt$category_id), fill=1:3)

Apart from using plot(), we can also create statistical plots using functions such as hist(). hist() takes a numeric vector and creates a histogram:

hist(vids.agt$likesp)

We can additionally use the breaks argument to control the number of bins if we were not satisfied with the default values:

hist(vids.agt$likesp, breaks=20)

Just like how we can add abline onto our plot, we can add graphical elements like lines onto this histogram too. In fact, let’s do that and also use the main argument to give our plot a new main title:

hist(vids.agt$likesp, breaks=20, ylim=c(0, 20), col="lightblue", main="Distribution of likes-per-view")
lines(density(vids.agt$likesp), col="darkblue")

While base plot can be very simple to use, they can be effective too. In fact, with the use of proper coloring, annotation and a little care on the aesthetic touches, you can communicate a lot in a graph using just R’s built-in plotting system.

In the following code chunk I’m subsetting from vids.agt only trending videos that have more than 10,000 likes and order it by the likes-to-view variable. I added a new variable, col to this new dataframe to be used in my following plot:

vids.ags <- vids.agt[vids.agt$likes > 10000, ]
vids.ags <- vids.ags[order(vids.ags$likesp), ]

# create color specifications for our dotchart
vids.ags$col[vids.ags$category_id == "Autos and Vehicles"] <- "goldenrod4"
vids.ags$col[vids.ags$category_id == "Gaming"] <- "dodgerblue4"
vids.ags$col[vids.ags$category_id == "Travel and Events"] <- "firebrick4"

We’re going to create a dot chart (or a Cleveland’s Dot Plot) by graphing the likes to view ratio of each trending video in the vids.ags dataframe, map channel_title to the labels and group these labels by category_id.

dotchart(vids.ags$likesp, labels=vids.ags$channel_title, cex=.7, pch=19, groups=vids.ags$category_id, col=vids.ags$col)

With this we see that between groups, the likes to view proportion (we’ll call it “likeability” from now on) of Autos and Vehicles are rather similar with Travel and Events.However, for gaming videos we do see a larger variance in that the top video by likeability is close to 0.13, more than 6 times difference to that of other trending videos in this category. We also observe the rough mean likeability within groups, as well as between them. We would expect Travel and Events videos to have ~4 likes per 100 views, and Gaming videos to have more than that due to the positive skew we observe from above.

Let’s talk about another kind of plot, one that most statisticians find cringeworthy for it’s undeserved popularity and prevalence in the workplace. Yes, it is the pie chart. In R’s official documentation, the pie chart is criticized as being “a very bad way of displaying information [because] the eye is good at judging linear measures and bad at judging relative areas”. Almost any data that can be represented in a pie chart can be illustrated with a bar chart or dot chart2.

If you insist on creating one, here’s the code (I’ve added some colors to make it easier to get a grasp of the measures):

pie(table(vids.ags$publish_hour), labels=names(table(vids.ags$publish_hour)), col=topo.colors(24))

2.2 Grammar of Graphics in R

2.2.1 The motivation of ggplot2

ggplot2 is created by Hadley Wickham in 2005 as an implementation of Lelad Wilkinson’s Grammar of Graphics. The idea with Grammar of Graphics is to create a formal model for data visualization, by breaking down graphics into components that could be systematically added or subtracted by the end user.

With ggplot2, plots may be created using qplot() where arguments and defaults are handled similarly to the base plotting system, or through ggplot() where user can add or alter plot components layer-by-layer with a high level of modularity.

The last point is especially important because it allows the data scientists to work with plots in a system that breaks up these different tasks. Instead of a huge, conceptually flat list of parameters to control every aspect of the plot’s final outcome, this system makes plotting a series of distinct task, each focused on one aspect of the plot’s final output.

Let us take a look at a simple example, drawing inspiration from the Earthquake incident that happened in the south of Jakarta this week (as of this writing). I’ve created a dataframe called gempa:

gempa <- data.frame(
  x=c(3.5,3,4,4.5,4.1),
  y=c(12,14,12.4,12.5,14), 
  size=c(14,4,4,6,12)
)

And now I’ll create a ggplot object using ggplot(). Because of the range of my values, this plot will use that and create a plot with these values on each scales (scales, by the way, can be thought of as just the two axis right now). Note that we’re just creating a blank plot with no geometry elements (lines, points, etc) on it yet. We save this object as g:

g <- ggplot(gempa, aes(x = x, y = y))
g

Notice how ggplot() takes two arguments: - The data - The aes which allow us to specify our mapping of the x and y variables so they are used accordingly by ggplot

Once we created our ggplot object (we named it g), we can now add a layer onto it using geom_. geom is ggplot’s way of handling geometry, i.e. how the data are represented on the plot. To illustrate this idea, let’s add a geom_point and then print the resulting object:

g + geom_point()

A recap of what we’ve done so far:

  • Creating our ggplot graphics object through ggplot()
  • We specify 2 arguments in our call to ggplot; It’s helpful to note that any argument we pass into ggplot() will be used as global options for the plot, i.e. they apply to all layers we add onto that graphics object
  • For the second argument we use the aes() function, allowing us to map variables from our gempa data to aesthetic properties of the plot (in our case we map them to the x and y axis)
  • We tell ggplot how we want the graphic objects to be represented by adding (through the “+” operator) our geom layer. Since we added geom_point, this is equivalent to adding a layer of scatterplot to represent our x and y variables

As we familiarize ourselves with this system, we will learn to use other functions to obtain a more precise control over the construction of our plot. This could be natively ggplot constructs such as scales, legends, geoms and thematic elements or this could be additional constructs that work with ggplot through the use of third-party packages. In the following example, we’re adding background_image to our original plot (g) before adding geom_point on top of the background image layer. Finally, we add the labels for our title and caption using the labs function:

library(png)
jak <- png::readPNG('jakarta.png')
g +
  background_image(jak)+
  geom_point(size=gempa$size, alpha = 0.6, col="red2")+
  labs(title="Disaster Impact Zone, Invasion Area 2018", caption="source: Jakarta Distaster Relief")

Because of this design philosophy in ggplot, it presents a learning curve that is beginner-friendly and mostly logical. I said beginner-friendly, because as we will see later, all we need to do is to master the starting steps first and not worry about polishing. And with starting steps, this means:
- 1: ggplot() with data and aesthetics mapping (aes) - 2: Add to (1) a single geom layer

2.4 Learn-by-building Module: Data Visualization

To help us get ready to the learn-by-building assignment, we’ll walk through a simple exercise together. There isn’t any new concept being introduced, but rather a start-to-finish recap of the data visualization process.

First, we’ll have to subset any videos within the News and Politics category and take only the ones that have more than 10 trending videos during the explanatory period. We run the following code and see that 6 channels / publishers satisfy that condition:

news <- vids.u[vids.u$category_id == "News and Politics", ]
news <- aggregate(trending_date ~ channel_title, news, length)
news <- news[news$trending_date > 10, ]
news <- news[order(news$trending_date, decreasing = T),]
news
##      channel_title trending_date
## 83             Vox            29
## 30             CNN            17
## 10        BBC News            12
## 84 Washington Post            12
## 40   Guardian News            11
## 78           TODAY            11

We can now use vids.u$channel_title %in% news$channel_title as our data source, indicating that we only wish to create our ggplot using channels that are in the list of 6 news channels above. The rest of the code is relatively straightforward:

ggplot(data=vids.u[vids.u$channel_title %in% news$channel_title,], aes(x=publish_time, y=views))+
  geom_point(aes(col=log(likes/dislikes), size=comment_count))+
  facet_grid(publish_wday~channel_title)+
  scale_color_gradient(low="red3", high="green2")+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 90, hjust = 1))

We don’t always need to use a facet plot, if the information you wish to convey doesn’t require that. The following plot does not tells us the sentiment or comment count of each video, but does enough to tell us about each channel’s representation in the trending videos.

news.agg <- vids.u[vids.u$channel_title %in% news$channel_title, ]
news.agg <- aggregate(views ~ channel_title, news.agg, mean)
names(news.agg) <- c("channel_title", "mean_views")
news.agg
##     channel_title mean_views
## 1        BBC News   170525.0
## 2             CNN   274701.6
## 3   Guardian News   328777.5
## 4           TODAY   194893.6
## 5             Vox   517597.6
## 6 Washington Post   109853.3

The plot:

ggplot(data=vids.u[vids.u$channel_title %in% news$channel_title,], aes(x=publish_time, y=views))+
  geom_hline(data=news.agg, aes(yintercept=mean_views, col=channel_title), linetype=2, alpha=0.8)+
  geom_point(aes(size=likes/views, col=channel_title), stroke=1, alpha=0.85)+
  guides(size=F)+
  labs(title="Clash of the Media Giants", x="Date", y="Video Views", subtitle="Vox leads in terms of quantity (most videos represented) and quality (most views on average)")+
  theme(legend.position = "bottom")

2.4.1 Using Themes

We can spice up our visualizations using another nifty feature of ggplot: themes! I’ve copied and pasted the code from our earlier exercise and added a theme using theme_linedraw(). I invite you to go ahead and swap out the theme and replace it with one of the other themes. Examples:
- theme_calc()
- theme_excel() - theme_gdocs()
- theme_classic()

ggplot(temp1, aes(x=Var1, y=Freq))+
  geom_col()+
  theme_linedraw()+
  labs(title="Most Represented Channels among Trending Videos")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Now try one more example - try and replace theme_wsj with one of the other themes in the following plot. A simple switch of the overall presentation “theme” using a short function like theme_ makes it easy to experiment with different aesthetics and is one of yet another benefit of using ggplot!

ggplot(data = vids.ags, aes(x=category_id, y=dislikes, size=comment_count/views))+
  geom_jitter(aes(col=publish_wday))+
  theme_wsj(base_size = 8)+
  labs(title = "Trending Videos Between 3 Categories")+
  theme(legend.position = "none")

2.5 Introduction to Leaflet [Optional]

Leaflet is among the most popular JS library for interative maps, used by websites such as The New York Times, The Washington Post, GitHub and Flickr3. The R package leaflet allows us to create leaflet maps directly in R code. The steps are as follow:

  1. Create a map widget by calling leaflet().
  2. Add layers (i.e., features) to the map by using layer functions (addTiles, addMarkers, addPolygons) to modify the map widget.

Sounds similar enough to the ggplot system? Let’s see a simple example.

I’m going to create two objects to be used for our Leaflet map later. First, an icon! Here I’m using Algoritma’s main icon and saving it to an object called ico. Next, we’ll create loca, a data frame that has two variables (lat and lng) with some randomly generated numbers. The code is straightforward:

set.seed(418)
library(leaflet)

ico <- makeIcon(
    iconUrl = "https://algorit.ma/wp-content/uploads/2017/07/logo_light_trans.png",
    iconWidth=177/2, iconHeight=41/2
)


loca <- data.frame(lat=runif(5, min = -6.24, max=-6.23),
                   lng=runif(5, min=106.835, max=106.85))

And we’ll now create our map:

# create a leaflet map widget
map1 <- leaflet()

# add tiles from open street map
map1 <- addTiles(map1)

# add markers
map1 <- addMarkers(map1, data = loca, icon=ico)

map1

Supposed we want the end user to be able to click on each of these icons and have a simple pop-up description, we can add that to our map too!

Create the pop-up text:

pops <- c(
    "<h3>Algoritma Main HQ</h3><p>Visit us here!</p>",
    "<strong>Algoritma Business Campus</strong>", 
    "<h3>In-Construction</h3><p>New Secondary Campus</p>",
    "<strong>Secondary Campus</strong>",
    "<strong>The Basecamp (business-school)</strong>"
)

Adding them to our map:

map1 <- leaflet()
map1 <- addTiles(map1)
map1 <- addMarkers(map1, data = loca, icon=ico, popup = pops)

map1

leaflet does a lot more than the simple demonstration above, but since it belongs to the optional part of this coursebook - I’ll leave it up to you, the readers, to further explore its possibilities! While I would recommend you to use ggplot as the main focus of your graded assignment, I want to leave the choice up to you. Work with your academic mentors to produce a visualization as specified in the learn-by-building module and good luck!

3 Summary

The coursebook covers many aspects of plotting, including using visualization libraries such as ggplot2, leaflet and a few other supporting libraries. I hope you’ve managed to get a good grasp of the plotting philosophy behind ggplot2, and have built a few visualizations with it yourself!

Happy coding!

Samuel

3.1 Annotations

Cleveland (1985), page 264: “Data that can be shown by pie charts always can be shown by a dot chart. This means that judgements of position along a common scale can be made instead of the less accurate angle judgements.” This statement is based on the empirical investigations of Cleveland and McGill as well as investigations by perceptual psychologists."


  1. This dataset is first contributed under CC0 public domain by Mitchell J (datasnaek on Kaggle), and maintained by other contributors. It has 13,400 records of trending videos between 14th November 2017 to 21st January 2018.

  2. Full Note on pie charts from the official R Documentation:
    “Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.

  3. Official Documentation, Leaflet